Types of Recommendation Systems :¶Content Based Recommendation System :¶This system promotes or recommends movies to user based on the movies that they have watched before. For example , if person watched action movies before, then it will recommend action movies for him.
Popularity Based Recommendation System :¶This system will recommend top movies in film platforms such as Netflix or cinemas.
Collaborative Recommendation System :¶This system groups people based on their watching pattern. Then if a user watch a film of this group's films then the system will recommend the films watched by this group to the user. (Recommend based on other previous data).
Workflow :¶Data Collection :¶We need to have a data of this movies. (Movie description, Type of the movie . . . etc).
Data PreProcessing :¶Clean data for any missing or incomplete values.
Feature Extraction :¶There are textual features in data frame , we can not ues it directly. So, We need to convet into meaningful numerical values)
Find the Similarity :¶We have 5000 movie and we want to find which movies are similar to each other by giving them a similarity score (Similarity Confidence Score).
User Input :¶Ask user for his input , so based on user input we should suggest which movie user can watch.
Use Cosine Similarity :¶This percent similarity algorithm is used in order to find the similarity between the vectors so here we will just converting each movies into a kind of a vector and we will try to find the similarity between them using Cosine-similarity. So when a user gives a movie name, we will try to compare that movie and we will just try to find which movies are similar to the one given by the user. now we will get a list of movies and we can
Import Libraries :¶import ast
import numpy as np # NumPy is a Python library used for working with arrays.
import pandas as pd # Pandas is mainly used for data analysis. Pandas allows importing data from various file formats such as comma-separated values, JSON, SQL, and Microsoft Excel.
import seaborn as sns # Seaborn is a library in Python predominantly used for making statistical graphics. Seaborn is a data visualization library built on top of matplotlib .
import matplotlib.pyplot as plt # Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its numerical extension NumPy.
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
# Loading the dataframe
df = pd.read_csv('tmdb_5000_movies.csv')
df_credits = pd.read_csv('tmdb_5000_credits.csv')
df = df.merge(df_credits,on='title')
df.head() # Show the first 5 rows in the data
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | ... | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 | 19995 | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
| 1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 | 285 | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
| 2 | 245000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.sonypictures.com/movies/spectre/ | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | ... | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... |
| 3 | 250000000 | [{"id": 28, "name": "Action"}, {"id": 80, "nam... | http://www.thedarkknightrises.com/ | 49026 | [{"id": 849, "name": "dc comics"}, {"id": 853,... | en | The Dark Knight Rises | Following the death of District Attorney Harve... | 112.312950 | [{"name": "Legendary Pictures", "id": 923}, {"... | ... | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 | 49026 | [{"cast_id": 2, "character": "Bruce Wayne / Ba... | [{"credit_id": "52fe4781c3a36847f81398c3", "de... |
| 4 | 260000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://movies.disney.com/john-carter | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | ... | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... |
5 rows × 23 columns
df_credits.head()
| movie_id | title | cast | crew | |
|---|---|---|---|---|
| 0 | 19995 | Avatar | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
| 1 | 285 | Pirates of the Caribbean: At World's End | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
| 2 | 206647 | Spectre | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... |
| 3 | 49026 | The Dark Knight Rises | [{"cast_id": 2, "character": "Bruce Wayne / Ba... | [{"credit_id": "52fe4781c3a36847f81398c3", "de... |
| 4 | 49529 | John Carter | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... |
Describe the data :¶df.shape # Show the number of rows and columns as a tuple (number of rows, number of columns).
# There is 4803 rows and 20 column
(4809, 23)
df.columns # Show name of columns
Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
'original_title', 'overview', 'popularity', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
'vote_count', 'movie_id', 'cast', 'crew'],
dtype='object')
#===================================== Column Names and its meaning ====================================
# Budget : The budget in which the movie was made.
# Genre : Type of the movie action , drama , horror ...
# Homepage : Offical page link (Where you can watch the movie).
# Id : ID of the film.
# Keyword : The keywords or tags related to the movie : Words tell the kind of the idea of about what the movie is.
# Original language : The language in which the movie was made.
# Original title : The title of the movie before translation or adaptation.
# Overview : A brief description of the movie.
# Popularity : A numeric quantity specifying the movie popularity.
# Production companies : The production house of the movie.
# Production countries : The country in which it was produced.
# Release date : The date on which it was released.
# Revenue : The worldwide revenue generated by the movie.
# Runtime : The running time of the movie in minutes.
# Status : "Released" or "Rumored".
# Tagline : Movie's tagline.
# Title : Title of the movie.
# Vote average : average ratings the movie recieved.
# Vote count : the count of votes recieved.
df.describe() # It calculate some statistical data like percentile, mean and std of the numerical values of DataFrame.
| budget | id | popularity | revenue | runtime | vote_average | vote_count | movie_id | |
|---|---|---|---|---|---|---|---|---|
| count | 4.809000e+03 | 4809.000000 | 4809.000000 | 4.809000e+03 | 4807.000000 | 4809.000000 | 4809.000000 | 4809.000000 |
| mean | 2.902780e+07 | 57120.571429 | 21.491664 | 8.227511e+07 | 106.882255 | 6.092514 | 690.331670 | 57120.571429 |
| std | 4.070473e+07 | 88653.369849 | 31.803366 | 1.628379e+08 | 22.602535 | 1.193989 | 1234.187111 | 88653.369849 |
| min | 0.000000e+00 | 5.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 5.000000 |
| 25% | 7.800000e+05 | 9012.000000 | 4.667230 | 0.000000e+00 | 94.000000 | 5.600000 | 54.000000 | 9012.000000 |
| 50% | 1.500000e+07 | 14624.000000 | 12.921594 | 1.917000e+07 | 103.000000 | 6.200000 | 235.000000 | 14624.000000 |
| 75% | 4.000000e+07 | 58595.000000 | 28.350529 | 9.291317e+07 | 118.000000 | 6.800000 | 737.000000 | 58595.000000 |
| max | 3.800000e+08 | 459488.000000 | 875.581305 | 2.787965e+09 | 338.000000 | 10.000000 | 13752.000000 | 459488.000000 |
How to deal with outliers data ?¶print("Number of films that have a budget less that 100 : ",len(df[df['budget'] < 100]))
sns.boxplot(df['budget'])
Number of films that have a budget less that 100 : 1062
C:\Users\com\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='budget'>
# Budget
Q1 = df.budget.quantile(0.25)
Q3 = df.budget.quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
lower_limit, upper_limit # Upper limit allowed , Lower limit allowed
(-58050000.0, 98830000.0)
df[df["budget"] >upper_limit]
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | ... | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 | 19995 | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
| 1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 | 285 | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
| 2 | 245000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.sonypictures.com/movies/spectre/ | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | ... | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... |
| 3 | 250000000 | [{"id": 28, "name": "Action"}, {"id": 80, "nam... | http://www.thedarkknightrises.com/ | 49026 | [{"id": 849, "name": "dc comics"}, {"id": 853,... | en | The Dark Knight Rises | Following the death of District Attorney Harve... | 112.312950 | [{"name": "Legendary Pictures", "id": 923}, {"... | ... | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 | 49026 | [{"cast_id": 2, "character": "Bruce Wayne / Ba... | [{"credit_id": "52fe4781c3a36847f81398c3", "de... |
| 4 | 260000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://movies.disney.com/john-carter | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | ... | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 565 | 150000000 | [{"id": 12, "name": "Adventure"}, {"id": 16, "... | http://www.shrek2.com/ | 809 | [{"id": 378, "name": "prison"}, {"id": 2343, "... | en | Shrek 2 | Shrek, Fiona and Donkey set off to Far, Far Aw... | 47.320801 | [{"name": "DreamWorks SKG", "id": 27}, {"name"... | ... | 93.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Once upon another time... | Shrek 2 | 6.7 | 2988 | 809 | [{"cast_id": 29, "character": "Shrek (voice)",... | [{"credit_id": "57d85f1ac3a36878e90025c4", "de... |
| 566 | 120000000 | [{"id": 16, "name": "Animation"}, {"id": 12, "... | http://disney.go.com/disneyvideos/animatedfilm... | 920 | [{"id": 830, "name": "car race"}, {"id": 1926,... | en | Cars | Lightning McQueen, a hotshot rookie race car d... | 82.643036 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 117.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Ahhh... it's got that new movie smell. | Cars | 6.6 | 3877 | 920 | [{"cast_id": 13, "character": "Lightning McQue... | [{"credit_id": "52fe428dc3a36847f8027841", "de... |
| 692 | 150000000 | [{"id": 16, "name": "Animation"}, {"id": 10751... | http://movies.disney.com/chicken-little | 9982 | [{"id": 1357, "name": "fish"}, {"id": 1415, "n... | en | Chicken Little | When the sky really is falling and sanity has ... | 47.973995 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 81.0 | [{"iso_639_1": "en", "name": "English"}] | Released | When it comes to saving the world, it helps to... | Chicken Little | 5.6 | 944 | 9982 | [{"cast_id": 19, "character": "Chicken Little ... | [{"credit_id": "52fe4557c3a36847f80c8aff", "de... |
| 1065 | 120000000 | [{"id": 12, "name": "Adventure"}, {"id": 16, "... | http://movies.disney.com/a-bugs-life | 9487 | [{"id": 1442, "name": "winter"}, {"id": 1721, ... | en | A Bug's Life | On behalf of "oppressed bugs everywhere," an i... | 87.350802 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 95.0 | [{"iso_639_1": "en", "name": "English"}] | Released | An epic presentation of miniature proportions. | A Bug's Life | 6.8 | 2303 | 9487 | [{"cast_id": 1, "character": "Hopper (voice)",... | [{"credit_id": "52fe44fec3a36847f80b64e5", "de... |
| 1658 | 100000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | NaN | 14164 | [{"id": 3436, "name": "karate"}, {"id": 9715, ... | en | Dragonball Evolution | The young warrior Son Goku sets out on a quest... | 21.677732 | [{"name": "Ingenious Film Partners", "id": 289... | ... | 85.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | The legend comes to life. | Dragonball Evolution | 2.9 | 462 | 14164 | [{"cast_id": 17, "character": "Master Roshi", ... | [{"credit_id": "52fe45d29251416c75063b05", "de... |
321 rows × 23 columns
median = df['budget'].median()
df["budget"] = np.where(df["budget"] >upper_limit, median,df['budget'])
df.shape
(4809, 23)
print("Number of films that have a revenue less that 100 : ",len(df[df['revenue'] < 100]))
len(df[df['revenue'] < 100])
sns.boxplot(df['revenue'])
Number of films that have a revenue less that 100 : 1448
C:\Users\com\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='revenue'>
# Revenue
Q1 = df.revenue.quantile(0.25)
Q3 = df.revenue.quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
lower_limit, upper_limit
(-139369756.5, 232282927.5)
df[df['revenue']> upper_limit]
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | ... | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 | 19995 | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
| 1 | 15000000.0 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 | 285 | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
| 2 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.sonypictures.com/movies/spectre/ | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | ... | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... |
| 3 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 80, "nam... | http://www.thedarkknightrises.com/ | 49026 | [{"id": 849, "name": "dc comics"}, {"id": 853,... | en | The Dark Knight Rises | Following the death of District Attorney Harve... | 112.312950 | [{"name": "Legendary Pictures", "id": 923}, {"... | ... | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 | 49026 | [{"cast_id": 2, "character": "Bruce Wayne / Ba... | [{"credit_id": "52fe4781c3a36847f81398c3", "de... |
| 4 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://movies.disney.com/john-carter | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | ... | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3703 | 6000000.0 | [{"id": 35, "name": "Comedy"}, {"id": 18, "nam... | http://workingtitlefilms.com/film.php?filmID=59 | 712 | [{"id": 213, "name": "upper class"}, {"id": 69... | en | Four Weddings and a Funeral | Four Weddings And A Funeral is a British comed... | 29.834065 | [{"name": "Channel Four Films", "id": 181}, {"... | ... | 117.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Five good reasons to stay single. | Four Weddings and a Funeral | 6.6 | 632 | 712 | [{"cast_id": 7, "character": "Charles", "credi... | [{"credit_id": "52fe426ec3a36847f801e2f3", "de... |
| 3820 | 4000000.0 | [{"id": 18, "name": "Drama"}, {"id": 10749, "n... | NaN | 770 | [{"id": 314, "name": "life and death"}, {"id":... | en | Gone with the Wind | An American classic in which a manipulative wo... | 48.982550 | [{"name": "Selznick International Pictures", "... | ... | 238.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The greatest romance of all time! | Gone with the Wind | 7.7 | 970 | 770 | [{"cast_id": 10, "character": "Scarlett O'Hara... | [{"credit_id": "52fe4274c3a36847f801fe01", "de... |
| 3831 | 3500000.0 | [{"id": 35, "name": "Comedy"}] | NaN | 9427 | [{"id": 1252, "name": "suicide attempt"}, {"id... | en | The Full Monty | Sheffield, England. Gaz, a jobless steelworker... | 17.002623 | [{"name": "Channel Four Films", "id": 181}, {"... | ... | 91.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The year's most revealing comedy. | The Full Monty | 6.8 | 363 | 9427 | [{"cast_id": 1, "character": "Gaz", "credit_id... | [{"credit_id": "52fe44f6c3a36847f80b45e9", "de... |
| 4447 | 858000.0 | [{"id": 16, "name": "Animation"}, {"id": 18, "... | http://movies.disney.com/bambi | 3170 | [{"id": 5774, "name": "forest"}, {"id": 10683,... | en | Bambi | Bambi's tale unfolds from season to season as ... | 47.651878 | [{"name": "Walt Disney Productions", "id": 3166}] | ... | 70.0 | [{"iso_639_1": "en", "name": "English"}] | Released | A great love story. | Bambi | 6.8 | 1405 | 3170 | [{"cast_id": 9, "character": "Young Bambi (voi... | [{"credit_id": "52fe438cc3a36847f805ca73", "de... |
| 4502 | 60000.0 | [{"id": 27, "name": "Horror"}, {"id": 9648, "n... | http://www.blairwitch.com/ | 2667 | [{"id": 616, "name": "witch"}, {"id": 3392, "n... | en | The Blair Witch Project | In October of 1994 three student filmmakers di... | 41.690578 | [{"name": "Artisan Entertainment", "id": 2188}... | ... | 81.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The scariest movie of all time is a true story. | The Blair Witch Project | 6.3 | 1055 | 2667 | [{"cast_id": 41, "character": "Mike", "credit_... | [{"credit_id": "52fe4364c3a36847f8050c01", "de... |
473 rows × 23 columns
median = df['revenue'].median()
df["revenue"] = np.where(df["revenue"] >upper_limit, median,df['revenue'])
df.shape
(4809, 23)
len(df[df['runtime'] == 0])
sns.boxplot(df['runtime'])
C:\Users\com\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='runtime'>
# Runtime
Q1 = df.runtime.quantile(0.25)
Q3 = df.runtime.quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
lower_limit, upper_limit
(58.0, 154.0)
df[df['runtime']> upper_limit]
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | ... | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 | 19995 | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
| 1 | 15000000.0 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | ... | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 | 285 | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
| 3 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 80, "nam... | http://www.thedarkknightrises.com/ | 49026 | [{"id": 849, "name": "dc comics"}, {"id": 853,... | en | The Dark Knight Rises | Following the death of District Attorney Harve... | 112.312950 | [{"name": "Legendary Pictures", "id": 923}, {"... | ... | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 | 49026 | [{"cast_id": 2, "character": "Bruce Wayne / Ba... | [{"credit_id": "52fe4781c3a36847f81398c3", "de... |
| 22 | 15000000.0 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://www.thehobbit.com/ | 57158 | [{"id": 603, "name": "elves"}, {"id": 604, "na... | en | The Hobbit: The Desolation of Smaug | The Dwarves, Bilbo and Gandalf have successful... | 94.370564 | [{"name": "WingNut Films", "id": 11}, {"name":... | ... | 161.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Beyond darkness... beyond desolation... lies t... | The Hobbit: The Desolation of Smaug | 7.6 | 4524 | 57158 | [{"cast_id": 3, "character": "Bilbo Baggins", ... | [{"credit_id": "52fe4926c3a36847f818b787", "de... |
| 24 | 15000000.0 | [{"id": 12, "name": "Adventure"}, {"id": 18, "... | NaN | 254 | [{"id": 774, "name": "film business"}, {"id": ... | en | King Kong | In 1933 New York, an overly ambitious movie pr... | 61.226010 | [{"name": "WingNut Films", "id": 11}, {"name":... | ... | 187.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The eighth wonder of the world. | King Kong | 6.6 | 2337 | 254 | [{"cast_id": 5, "character": "Ann Darrow", "cr... | [{"credit_id": "52fe422ec3a36847f800a1d7", "de... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4395 | 0.0 | [{"id": 53, "name": "Thriller"}] | NaN | 20296 | [] | en | Chocolate: Deep Dark Secrets | Christmas Eve, London. While the snow-clad cit... | 0.887821 | [] | ... | 200.0 | [{"iso_639_1": "hi", "name": "\u0939\u093f\u09... | Released | NaN | Chocolate: Deep Dark Secrets | 3.4 | 6 | 20296 | [{"cast_id": 1, "character": "Advocate Krishan... | [{"credit_id": "52fe43e0c3a368484e0033f3", "de... |
| 4486 | 700000.0 | [{"id": 99, "name": "Documentary"}] | NaN | 14275 | [{"id": 520, "name": "chicago"}, {"id": 1483, ... | en | Hoop Dreams | This documentary follows two inner-city Chicag... | 9.188431 | [{"name": "Fine Line Features", "id": 8}, {"na... | ... | 171.0 | [{"iso_639_1": "en", "name": "English"}] | Released | An Extraordinary True Story. | Hoop Dreams | 7.7 | 87 | 14275 | [{"cast_id": 1, "character": "Himself", "credi... | [{"credit_id": "52fe45e19251416c75065927", "de... |
| 4503 | 600000.0 | [{"id": 36, "name": "History"}, {"id": 99, "na... | NaN | 9459 | [{"id": 458, "name": "hippie"}, {"id": 460, "n... | en | Woodstock | An intimate look at the Woodstock Music & Art ... | 3.409764 | [{"name": "Wadleigh-Maurice", "id": 3816}, {"n... | ... | 225.0 | [{"iso_639_1": "en", "name": "English"}] | Released | 3 days of peace, music...and love. | Woodstock | 7.1 | 66 | 9459 | [{"cast_id": 10, "character": "Himself", "cred... | [{"credit_id": "52fe44fac3a36847f80b55ef", "de... |
| 4541 | 2000000.0 | [{"id": 28, "name": "Action"}, {"id": 18, "nam... | NaN | 346 | [{"id": 233, "name": "japan"}, {"id": 1462, "n... | ja | 七人の侍 | A samurai answers a village's request for prot... | 39.756748 | [{"name": "Toho Company", "id": 882}] | ... | 207.0 | [{"iso_639_1": "ja", "name": "\u65e5\u672c\u8a... | Released | The Mighty Warriors Who Became the Seven Natio... | Seven Samurai | 8.2 | 878 | 346 | [{"cast_id": 13, "character": "Kikuchiyo", "cr... | [{"credit_id": "52fe423bc3a36847f800dfef", "de... |
| 4598 | 385907.0 | [{"id": 18, "name": "Drama"}] | NaN | 3059 | [{"id": 279, "name": "usa"}, {"id": 2487, "nam... | en | Intolerance | The story of a poor young woman, separated by ... | 3.232447 | [{"name": "Triangle Film Corporation", "id": 1... | ... | 197.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Cruel Hand of Intolerance | Intolerance | 7.4 | 60 | 3059 | [{"cast_id": 23, "character": "The Woman Who R... | [{"credit_id": "577fe3e79251415db5000bf2", "de... |
140 rows × 23 columns
print(len(df[df['runtime']< lower_limit]))
df[df['runtime']< lower_limit]
42
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1014 | 0.0 | [{"id": 27, "name": "Horror"}] | NaN | 53953 | [{"id": 10292, "name": "gore"}, {"id": 12339, ... | de | The Tooth Fairy | A woman and her daughter (Nicole Muñoz) encoun... | 0.716764 | [] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | NaN | The Tooth Fairy | 4.3 | 13 | 53953 | [{"cast_id": 2, "character": "Peter Campbell",... | [{"credit_id": "52fe4885c3a36847f816b927", "de... |
| 3117 | 0.0 | [{"id": 18, "name": "Drama"}, {"id": 80, "name... | NaN | 41894 | [] | en | Blood Done Sign My Name | A drama based on the true story in which a bla... | 0.397341 | [] | ... | 0.0 | [] | Released | No one changes the world alone. | Blood Done Sign My Name | 6.0 | 5 | 41894 | [{"cast_id": 0, "character": "Boo Tyson", "cre... | [{"credit_id": "58ba3af09251416073014bc1", "de... |
| 3359 | 0.0 | [{"id": 99, "name": "Documentary"}] | NaN | 24977 | [{"id": 6075, "name": "sport"}] | en | Michael Jordan to the Max | This documentary showcases basketball player M... | 1.830306 | [{"name": "IMAX", "id": 3447}] | ... | 46.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Up close some heroes get even bigger. | Michael Jordan to the Max | 7.5 | 10 | 24977 | [{"cast_id": 1, "character": "Himself", "credi... | [{"credit_id": "5786e26ac3a3685338000bdc", "de... |
| 3408 | 0.0 | [{"id": 10751, "name": "Family"}, {"id": 16, "... | NaN | 294512 | [] | en | Alpha and Omega: The Legend of the Saw Tooth Cave | The Alphas and Omegas share a thrilling advent... | 1.874783 | [{"name": "Crest Animation Production", "id": ... | ... | 53.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | Alpha and Omega: The Legend of the Saw Tooth Cave | 6.5 | 4 | 294512 | [{"cast_id": 0, "character": "Runt", "credit_i... | [{"credit_id": "578f33d4c3a3686ecc00f86a", "de... |
| 3476 | 6000000.0 | [{"id": 99, "name": "Documentary"}] | NaN | 57612 | [{"id": 630, "name": "dolphin"}, {"id": 4676, ... | en | Dolphins and Whales: Tribes of the Ocean | This documentary goes to coral reefs of the Ba... | 0.041651 | [{"name": "3D Entertainment", "id": 5313}, {"n... | ... | 42.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | Dolphins and Whales: Tribes of the Ocean | 8.0 | 3 | 57612 | [{"cast_id": 2, "character": "Narrator (voice)... | [{"credit_id": "52fe493dc3a36847f819012d", "de... |
| 3631 | 5000000.0 | [{"id": 99, "name": "Documentary"}] | NaN | 78394 | [{"id": 10506, "name": "prehistoric"}, {"id": ... | en | Sea Rex 3D: Journey to a Prehistoric World | Through the power of IMAX 3D, experience a won... | 4.498368 | [{"name": "N3D Land Productions", "id": 29943}... | ... | 41.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The T-Rex of the Seas come alive. | Sea Rex 3D: Journey to a Prehistoric World | 5.9 | 11 | 78394 | [{"cast_id": 1002, "character": "Conservatory ... | [{"credit_id": "556ca631c3a3685489006251", "de... |
| 3677 | 0.0 | [{"id": 35, "name": "Comedy"}, {"id": 18, "nam... | http://www.romeothemovie.com/ | 113406 | [] | en | Should've Been Romeo | A self-centered, middle-aged pitchman for a po... | 0.407030 | [{"name": "Phillybrook Films", "id": 65147}] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Even Shakespeare didn't see this one coming. | Should've Been Romeo | 0.0 | 0 | 113406 | [{"cast_id": 4, "character": "Joey", "credit_i... | [{"credit_id": "5617d84d92514166e2001e21", "de... |
| 3816 | 4000000.0 | [{"id": 35, "name": "Comedy"}, {"id": 10749, "... | NaN | 158150 | [] | en | How to Fall in Love | An accountant, who never quite grew out of his... | 1.923514 | [{"name": "Annuit Coeptis Entertainment Inc.",... | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | How to Fall in Love | 5.2 | 20 | 158150 | [{"cast_id": 1, "character": "Annie Hayes", "c... | [{"credit_id": "52fe4bdd9251416c910e82a3", "de... |
| 3960 | 0.0 | [{"id": 10752, "name": "War"}, {"id": 18, "nam... | NaN | 281230 | [{"id": 187056, "name": "woman director"}] | en | Fort McCoy | Unable to serve in World War II because of a h... | 0.384496 | [] | ... | 0.0 | [] | Released | NaN | Fort McCoy | 6.3 | 2 | 281230 | [{"cast_id": 0, "character": "Frank Stirn", "c... | [{"credit_id": "54e269aec3a368454b007976", "de... |
| 3999 | 0.0 | [] | NaN | 346081 | [] | en | Sardaarji | A ghost hunter uses bottles to capture trouble... | 0.296981 | [] | ... | 0.0 | [] | Released | NaN | Sardaarji | 9.5 | 2 | 346081 | [] | [{"credit_id": "558ab3f4925141076f0001d7", "de... |
| 4075 | 0.0 | [] | NaN | 371085 | [] | en | Sharkskin | The Post War II story of Manhattan born Mike E... | 0.027801 | [] | ... | 0.0 | [] | Released | NaN | Sharkskin | 0.0 | 0 | 371085 | [] | [] |
| 4125 | 0.0 | [] | NaN | 325140 | [] | en | Hum To Mohabbat Karega | Raju, a waiter, is in love with the famous TV ... | 0.001186 | [] | ... | 0.0 | [] | Released | NaN | Hum To Mohabbat Karega | 0.0 | 0 | 325140 | [] | [] |
| 4212 | 0.0 | [{"id": 18, "name": "Drama"}, {"id": 80, "name... | http://www.imdb.com/title/tt1289419/ | 66468 | [] | en | N-Secure | N-Secure is a no holds-barred thrilling drama ... | 0.134560 | [] | ... | 0.0 | [] | Released | NaN | N-Secure | 4.3 | 4 | 66468 | [{"cast_id": 3, "character": "David Alan Washi... | [{"credit_id": "52fe473ec3a368484e0bca79", "de... |
| 4217 | 0.0 | [{"id": 10749, "name": "Romance"}] | NaN | 74084 | [] | hi | दिल जो भी कहे | During the British rule in India, several Indi... | 0.122704 | [{"name": "Entertainment One Pvt. Ltd.", "id":... | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | NaN | Dil Jo Bhi Kahey... | 0.0 | 0 | 74084 | [{"cast_id": 2, "character": "Shekhar Sinha", ... | [{"credit_id": "575d52eac3a3683168003910", "de... |
| 4248 | 1500000.0 | [{"id": 35, "name": "Comedy"}] | NaN | 51820 | [{"id": 10183, "name": "independent film"}] | en | The Salon | A Beauty shop owner finds romance as she strug... | 2.028170 | [] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Where you get more than just a hair cut! | The Salon | 3.5 | 1 | 51820 | [{"cast_id": 2, "character": "Ricky", "credit_... | [{"credit_id": "52fe4805c3a36847f815475f", "de... |
| 4319 | 0.0 | [{"id": 53, "name": "Thriller"}, {"id": 27, "n... | NaN | 107315 | [{"id": 888, "name": "screenwriter"}] | en | Below Zero | When Jack (Edward Furlong) is in danger of mis... | 1.365140 | [] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | There's nothing scarier than a blank page. | Below Zero | 4.4 | 12 | 107315 | [{"cast_id": 1002, "character": "Jack / Frank"... | [{"credit_id": "52fe4a7dc3a36847f81d0e6b", "de... |
| 4324 | 0.0 | [{"id": 27, "name": "Horror"}] | NaN | 310933 | [] | en | Bleeding Hearts | Captured Hearts, an insane serial killer/horro... | 0.100533 | [] | ... | 0.0 | [] | Released | NaN | Bleeding Hearts | 2.0 | 1 | 310933 | [{"cast_id": 1, "character": "Sheriff Wilson",... | [{"credit_id": "58645a32c3a36852ba0151a9", "de... |
| 4328 | 0.0 | [{"id": 99, "name": "Documentary"}] | NaN | 102840 | [] | en | Sex With Strangers | For some married couples, sex is an obsession ... | 0.014406 | [] | ... | 0.0 | [] | Released | NaN | Sex With Strangers | 5.0 | 1 | 102840 | [] | [] |
| 4334 | 0.0 | [{"id": 27, "name": "Horror"}, {"id": 99, "nam... | NaN | 202604 | [{"id": 2626, "name": "exorcism"}] | en | The Vatican Exorcisms | Documentary following US film-maker Joe Marino... | 0.447166 | [] | ... | 0.0 | [{"iso_639_1": "it", "name": "Italiano"}, {"is... | Released | The public were never meant to know | The Vatican Exorcisms | 4.4 | 11 | 202604 | [{"cast_id": 2, "character": "Himself", "credi... | [{"credit_id": "52fe4cccc3a368484e1c7333", "de... |
| 4411 | 0.0 | [{"id": 10751, "name": "Family"}, {"id": 35, "... | https://www.epicbuzz.net/movies/karachi-se-lahore | 357441 | [] | en | Karachi se Lahore | A road trip from Karachi to Lahore where 5 fri... | 0.060003 | [] | ... | 0.0 | [{"iso_639_1": "ur", "name": "\u0627\u0631\u06... | Released | NaN | Karachi se Lahore | 8.0 | 1 | 357441 | [{"cast_id": 0, "character": "", "credit_id": ... | [] |
| 4441 | 0.0 | [{"id": 27, "name": "Horror"}] | NaN | 323270 | [{"id": 9706, "name": "anthology"}] | en | The Horror Network Vol. 1 | Serial killers, ghostly phone calls, inner dem... | 0.392658 | [] | ... | 0.0 | [] | Released | NaN | The Horror Network Vol. 1 | 5.0 | 2 | 323270 | [{"cast_id": 0, "character": "Hal", "credit_id... | [{"credit_id": "5798c7b29251411838003dd6", "de... |
| 4464 | 0.0 | [] | NaN | 279759 | [] | en | Harrison Montgomery | Film from Daniel Davila | 0.006943 | [] | ... | 0.0 | [] | Released | NaN | Harrison Montgomery | 0.0 | 0 | 279759 | [] | [] |
| 4472 | 0.0 | [{"id": 27, "name": "Horror"}, {"id": 878, "na... | NaN | 211557 | [] | en | Vessel | Vessel is the story of the passengers of Fligh... | 0.322553 | [{"name": "Baker's Dozen Productions", "id": 5... | ... | 14.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | Vessel | 5.9 | 8 | 211557 | [{"cast_id": 1, "character": "Danny", "credit_... | [{"credit_id": "52fe4d97c3a368484e1f1ad7", "de... |
| 4508 | 0.0 | [{"id": 80, "name": "Crime"}, {"id": 18, "name... | NaN | 263503 | [] | en | Water & Power | Twin brothers nicknamed "Water" and "Power" fr... | 0.350557 | [] | ... | 0.0 | [] | Released | NaN | Water & Power | 3.0 | 1 | 263503 | [{"cast_id": 2, "character": "Water", "credit_... | [{"credit_id": "5342e38cc3a368151e003aa9", "de... |
| 4510 | 0.0 | [] | NaN | 331493 | [] | en | Light from the Darkroom | Light in the Darkroom is the story of two best... | 0.012942 | [] | ... | 0.0 | [] | Released | NaN | Light from the Darkroom | 0.0 | 0 | 331493 | [] | [] |
| 4559 | 0.0 | [] | NaN | 380097 | [] | en | America Is Still the Place | 1971 post civil rights San Francisco seemed li... | 0.000000 | [] | ... | 0.0 | [] | Released | NaN | America Is Still the Place | 0.0 | 0 | 380097 | [] | [] |
| 4564 | 0.0 | [{"id": 10402, "name": "Music"}, {"id": 27, "n... | http://www.thedevilscarnival.com/ | 285743 | [{"id": 3473, "name": "carnival"}, {"id": 4344... | en | Alleluia! The Devil's Carnival | The Devil's Carnival: Alleluia! is the second ... | 0.674398 | [{"name": "Limb from Limb Pictures", "id": 590... | ... | 0.0 | [] | Released | Hell ain't got a prayer. | Alleluia! The Devil's Carnival | 6.0 | 2 | 285743 | [{"cast_id": 11, "character": "Lucifer", "cred... | [{"credit_id": "53e40a4f0e0a262b8f0050c4", "de... |
| 4570 | 0.0 | [{"id": 18, "name": "Drama"}] | NaN | 94072 | [] | en | Straight Out of Brooklyn | A Special Jury Award winner at the Sundance Fi... | 0.161517 | [] | ... | 0.0 | [] | Released | NaN | Straight Out of Brooklyn | 4.3 | 4 | 94072 | [] | [{"credit_id": "52fe49499251416c750c3167", "de... |
| 4572 | 0.0 | [] | NaN | 325579 | [] | en | Diamond Ruff | Action - Orphan, con artist, crime boss and mi... | 0.165257 | [] | ... | 0.0 | [] | Released | NaN | Diamond Ruff | 2.4 | 4 | 325579 | [] | [] |
| 4575 | 0.0 | [] | http://mutualfriendsmovie.com/ | 198370 | [] | en | Mutual Friends | Surprise parties rarely go well. This one is n... | 0.136721 | [] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Surprise parties rarely go well. | Mutual Friends | 0.0 | 0 | 198370 | [{"cast_id": 3, "character": "Liv", "credit_id... | [{"credit_id": "52fe4d7b9251416c9111797f", "de... |
| 4577 | 0.0 | [] | NaN | 328307 | [] | en | Rise of the Entrepreneur: The Search for a Bet... | The world is changing faster than ever. Techno... | 0.052942 | [] | ... | 0.0 | [] | Released | NaN | Rise of the Entrepreneur: The Search for a Bet... | 8.0 | 1 | 328307 | [] | [] |
| 4587 | 0.0 | [] | NaN | 281189 | [{"id": 187056, "name": "woman director"}] | en | Gory Gory Hallelujah | Four actors compete for the role of Jesus - a ... | 0.033883 | [] | ... | 0.0 | [] | Released | NaN | Gory Gory Hallelujah | 1.0 | 1 | 281189 | [] | [{"credit_id": "55135b3f9251412be400021e", "de... |
| 4590 | 0.0 | [{"id": 27, "name": "Horror"}, {"id": 35, "nam... | NaN | 189711 | [] | en | Love in the Time of Monsters | Two sisters travel to a cheesy tourist trap wh... | 0.133619 | [{"name": "Red Cube Picture", "id": 27892}, {"... | ... | 0.0 | [] | Released | NaN | Love in the Time of Monsters | 5.0 | 2 | 189711 | [{"cast_id": 1, "character": "Lou", "credit_id... | [{"credit_id": "52fe4d609251416c751392bb", "de... |
| 4617 | 0.0 | [] | NaN | 162396 | [] | en | The Big Swap | In this British drama, Ellen (Sorcha Brooks) a... | 0.627763 | [] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | The Big Swap | 0.0 | 0 | 162396 | [{"cast_id": 2, "character": "Sam", "credit_id... | [{"credit_id": "52fe4c59c3a36847f8229a57", "de... |
| 4626 | 0.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | NaN | 47534 | [{"id": 2792, "name": "boxer"}, {"id": 4076, "... | en | Fighting Tommy Riley | An aging trainer and a young fighter, both in ... | 0.045429 | [] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | Fighting Tommy Riley | 5.3 | 4 | 47534 | [{"cast_id": 3, "character": "", "credit_id": ... | [{"credit_id": "52fe4737c3a36847f8129b6b", "de... |
| 4639 | 0.0 | [] | NaN | 300327 | [] | en | Death Calls | An action-packed love story on the Mexican bor... | 0.005883 | [] | ... | 0.0 | [] | Released | NaN | Death Calls | 0.0 | 0 | 300327 | [] | [] |
| 4663 | 0.0 | [] | NaN | 320435 | [] | en | UnDivided | UnDivided documents the true story of how a su... | 0.010607 | [] | ... | 0.0 | [] | Released | NaN | UnDivided | 0.0 | 0 | 320435 | [] | [] |
| 4664 | 0.0 | [{"id": 27, "name": "Horror"}, {"id": 53, "nam... | NaN | 150211 | [{"id": 177972, "name": "bickering"}, {"id": 2... | en | The Frozen | After a harrowing snowmobile accident, a young... | 1.084387 | [] | ... | 0.0 | [] | Released | NaN | The Frozen | 4.2 | 14 | 150211 | [{"cast_id": 1, "character": "The Hunter", "cr... | [{"credit_id": "58e7ed2bc3a3684aa4045117", "de... |
| 4668 | 0.0 | [{"id": 35, "name": "Comedy"}] | NaN | 40963 | [{"id": 10183, "name": "independent film"}] | en | Little Big Top | An aging out of work clown returns to his smal... | 0.092100 | [{"name": "Fly High Films", "id": 24248}] | ... | 0.0 | [{"iso_639_1": "en", "name": "English"}] | Rumored | NaN | Little Big Top | 10.0 | 1 | 40963 | [{"cast_id": 0, "character": "Seymour", "credi... | [] |
| 4715 | 0.0 | [{"id": 16, "name": "Animation"}, {"id": 10751... | NaN | 13187 | [{"id": 65, "name": "holiday"}, {"id": 207317,... | en | A Charlie Brown Christmas | When Charlie Brown complains about the overwhe... | 8.701183 | [{"name": "Warner Bros. Home Video", "id": 5173}] | ... | 25.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | That's what Christmas is all about, Charlie Br... | A Charlie Brown Christmas | 7.5 | 153 | 13187 | [{"cast_id": 2, "character": "Freida (voice)",... | [{"credit_id": "52fe454b9251416c75051a75", "de... |
| 4735 | 0.0 | [{"id": 10751, "name": "Family"}] | NaN | 272726 | [] | en | Dude Where's My Dog? | Left home alone with his dog Harry, young Ray ... | 0.283970 | [] | ... | 0.0 | [] | Released | NaN | Dude Where's My Dog? | 0.0 | 0 | 272726 | [{"cast_id": 6, "character": "", "credit_id": ... | [{"credit_id": "592bed26c3a36877bc07a7e0", "de... |
| 4762 | 50000.0 | [{"id": 27, "name": "Horror"}, {"id": 53, "nam... | http://www.cthulhulives.org/cocmovie/index.html | 20981 | [{"id": 1523, "name": "obsession"}, {"id": 303... | en | The Call of Cthulhu | A dying professor leaves his great-nephew a co... | 1.777148 | [{"name": "HPLHS", "id": 17827}] | ... | 47.0 | [{"iso_639_1": "en", "name": "English"}] | Released | NaN | The Call of Cthulhu | 6.9 | 41 | 20981 | [{"cast_id": 4, "character": "The Man", "credi... | [{"credit_id": "52fe4407c3a368484e00b4a5", "de... |
42 rows × 23 columns
df = df[df['runtime']> lower_limit]
df = df[df['runtime']< upper_limit]
df.shape
(4613, 23)
Null values:¶df.isna().sum() # Returns the number of missing values in each column.
budget 0 genres 0 homepage 2950 id 0 keywords 0 original_language 0 original_title 0 overview 1 popularity 0 production_companies 0 production_countries 0 release_date 0 revenue 0 runtime 0 spoken_languages 0 status 0 tagline 782 title 0 vote_average 0 vote_count 0 movie_id 0 cast 0 crew 0 dtype: int64
df_credits.isna().sum() # There is no null values in this data frame
movie_id 0 title 0 cast 0 crew 0 dtype: int64
Drop some columns and rows with null value :¶df.drop('homepage', inplace=True, axis=1) # Drop Homepage column
df.drop(df[(df['runtime'] == 0)].index, inplace = True) #4. Drop Columns with runtime == 0 , as there is no film have 0 duration
df["tagline"].fillna("", inplace = True) #5. Drop Tagline column
df = df.dropna() #6. Drop null rows (Overview, Release date, Runtime)
df.isna().sum() # Now, there are not any null values
budget 0 genres 0 id 0 keywords 0 original_language 0 original_title 0 overview 0 popularity 0 production_companies 0 production_countries 0 release_date 0 revenue 0 runtime 0 spoken_languages 0 status 0 tagline 0 title 0 vote_average 0 vote_count 0 movie_id 0 cast 0 crew 0 dtype: int64
df.shape
(4612, 22)
df.head()
| budget | genres | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | ... | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "GB", "name": "United Kingdom"... | ... | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... |
| 4 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | [{"iso_3166_1": "US", "name": "United States o... | ... | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... |
| 5 | 15000000.0 | [{"id": 14, "name": "Fantasy"}, {"id": 28, "na... | 559 | [{"id": 851, "name": "dual identity"}, {"id": ... | en | Spider-Man 3 | The seemingly invincible Spider-Man goes up ag... | 115.699814 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "US", "name": "United States o... | ... | 139.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | The battle within. | Spider-Man 3 | 5.9 | 3576 | 559 | [{"cast_id": 30, "character": "Peter Parker / ... | [{"credit_id": "52fe4252c3a36847f80151a5", "de... |
| 6 | 15000000.0 | [{"id": 16, "name": "Animation"}, {"id": 10751... | 38757 | [{"id": 1562, "name": "hostage"}, {"id": 2343,... | en | Tangled | When the kingdom's most wanted-and most charmi... | 48.681969 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | ... | 100.0 | [{"iso_639_1": "en", "name": "English"}] | Released | They're taking adventure to new lengths. | Tangled | 7.4 | 3330 | 38757 | [{"cast_id": 34, "character": "Flynn Rider (vo... | [{"credit_id": "52fe46db9251416c91062101", "de... |
| 7 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 99861 | [{"id": 8828, "name": "marvel comic"}, {"id": ... | en | Avengers: Age of Ultron | When Tony Stark tries to jumpstart a dormant p... | 134.279229 | [{"name": "Marvel Studios", "id": 420}, {"name... | [{"iso_3166_1": "US", "name": "United States o... | ... | 141.0 | [{"iso_639_1": "en", "name": "English"}] | Released | A New Age Has Come. | Avengers: Age of Ultron | 7.3 | 6767 | 99861 | [{"cast_id": 76, "character": "Tony Stark / Ir... | [{"credit_id": "55d5f7d4c3a3683e7e0016eb", "de... |
5 rows × 22 columns
Correlation Matrix :¶# Let's make our correlation matrix a little prettier
corr_matrix =df.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
annot=True,
linewidths=0.9,
fmt=".2f",
cmap="YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
(8.5, -0.5)
Change release_date to year to easily deal with column :¶df['release_year'] = pd.to_datetime(df['release_date']).dt.year # Add a new column with the year of the film
df.drop('release_date', inplace=True, axis=1) # Drop the column of release_date because we replace it with the year column
df.head()
| budget | genres | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | ... | spoken_languages | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | release_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "GB", "name": "United Kingdom"... | ... | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... | 2015 |
| 4 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | [{"iso_3166_1": "US", "name": "United States o... | ... | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... | 2012 |
| 5 | 15000000.0 | [{"id": 14, "name": "Fantasy"}, {"id": 28, "na... | 559 | [{"id": 851, "name": "dual identity"}, {"id": ... | en | Spider-Man 3 | The seemingly invincible Spider-Man goes up ag... | 115.699814 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "US", "name": "United States o... | ... | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | The battle within. | Spider-Man 3 | 5.9 | 3576 | 559 | [{"cast_id": 30, "character": "Peter Parker / ... | [{"credit_id": "52fe4252c3a36847f80151a5", "de... | 2007 |
| 6 | 15000000.0 | [{"id": 16, "name": "Animation"}, {"id": 10751... | 38757 | [{"id": 1562, "name": "hostage"}, {"id": 2343,... | en | Tangled | When the kingdom's most wanted-and most charmi... | 48.681969 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | ... | [{"iso_639_1": "en", "name": "English"}] | Released | They're taking adventure to new lengths. | Tangled | 7.4 | 3330 | 38757 | [{"cast_id": 34, "character": "Flynn Rider (vo... | [{"credit_id": "52fe46db9251416c91062101", "de... | 2010 |
| 7 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 99861 | [{"id": 8828, "name": "marvel comic"}, {"id": ... | en | Avengers: Age of Ultron | When Tony Stark tries to jumpstart a dormant p... | 134.279229 | [{"name": "Marvel Studios", "id": 420}, {"name... | [{"iso_3166_1": "US", "name": "United States o... | ... | [{"iso_639_1": "en", "name": "English"}] | Released | A New Age Has Come. | Avengers: Age of Ultron | 7.3 | 6767 | 99861 | [{"cast_id": 76, "character": "Tony Stark / Ir... | [{"credit_id": "55d5f7d4c3a3683e7e0016eb", "de... | 2015 |
5 rows × 22 columns
Make a new column for runtime types :¶0 =>
duration <= 40 => Short Movies
1 => 40 < duration <= 70 => Mediam Duration Movies
2 => duration > 70 => Long Duration Movies
duration_genres = np.array([])
for i in df['runtime']:
if (i<=40):
duration_genres = np.append(duration_genres, 0)
elif (i>40 and i<=75):
duration_genres = np.append(duration_genres, 1)
if (i>75):
duration_genres = np.append(duration_genres, 2)
duration_genres
array([2., 2., 2., ..., 2., 2., 2.])
df['duration_genres'] = duration_genres
df.head()
| budget | genres | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | ... | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | release_year | duration_genres | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "GB", "name": "United Kingdom"... | ... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... | 2015 | 2.0 |
| 4 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 49529 | [{"id": 818, "name": "based on novel"}, {"id":... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | [{"iso_3166_1": "US", "name": "United States o... | ... | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... | 2012 | 2.0 |
| 5 | 15000000.0 | [{"id": 14, "name": "Fantasy"}, {"id": 28, "na... | 559 | [{"id": 851, "name": "dual identity"}, {"id": ... | en | Spider-Man 3 | The seemingly invincible Spider-Man goes up ag... | 115.699814 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "US", "name": "United States o... | ... | Released | The battle within. | Spider-Man 3 | 5.9 | 3576 | 559 | [{"cast_id": 30, "character": "Peter Parker / ... | [{"credit_id": "52fe4252c3a36847f80151a5", "de... | 2007 | 2.0 |
| 6 | 15000000.0 | [{"id": 16, "name": "Animation"}, {"id": 10751... | 38757 | [{"id": 1562, "name": "hostage"}, {"id": 2343,... | en | Tangled | When the kingdom's most wanted-and most charmi... | 48.681969 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | ... | Released | They're taking adventure to new lengths. | Tangled | 7.4 | 3330 | 38757 | [{"cast_id": 34, "character": "Flynn Rider (vo... | [{"credit_id": "52fe46db9251416c91062101", "de... | 2010 | 2.0 |
| 7 | 15000000.0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 99861 | [{"id": 8828, "name": "marvel comic"}, {"id": ... | en | Avengers: Age of Ultron | When Tony Stark tries to jumpstart a dormant p... | 134.279229 | [{"name": "Marvel Studios", "id": 420}, {"name... | [{"iso_3166_1": "US", "name": "United States o... | ... | Released | A New Age Has Come. | Avengers: Age of Ultron | 7.3 | 6767 | 99861 | [{"cast_id": 76, "character": "Tony Stark / Ir... | [{"credit_id": "55d5f7d4c3a3683e7e0016eb", "de... | 2015 | 2.0 |
5 rows × 23 columns
Genre Extraction function : from raw data for the creation of tags :¶def convert(obj):
L = []
for i in ast.literal_eval(obj):
L.append(i['name'])
return L
#FirsT Data Frame :
df['genres'] = df['genres'].apply(convert)
df['keywords'] = df['keywords'].apply(convert)
df['spoken_languages'] = df['spoken_languages'].apply(convert)
df['production_countries'] = df['production_countries'].apply(convert)
df['production_companies'] = df['production_companies'].apply(convert)
#--------------------------------------------------------------------------------------------
df.head()
| budget | genres | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | ... | status | tagline | title | vote_average | vote_count | movie_id | cast | crew | release_year | duration_genres | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 15000000.0 | [Action, Adventure, Crime] | 206647 | [spy, based on novel, secret agent, sequel, mi... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [Columbia Pictures, Danjaq, B24] | [United Kingdom, United States of America] | ... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 | 206647 | [{"cast_id": 1, "character": "James Bond", "cr... | [{"credit_id": "54805967c3a36829b5002c41", "de... | 2015 | 2.0 |
| 4 | 15000000.0 | [Action, Adventure, Science Fiction] | 49529 | [based on novel, mars, medallion, space travel... | en | John Carter | John Carter is a war-weary, former military ca... | 43.926995 | [Walt Disney Pictures] | [United States of America] | ... | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 | 49529 | [{"cast_id": 5, "character": "John Carter", "c... | [{"credit_id": "52fe479ac3a36847f813eaa3", "de... | 2012 | 2.0 |
| 5 | 15000000.0 | [Fantasy, Action, Adventure] | 559 | [dual identity, amnesia, sandstorm, love of on... | en | Spider-Man 3 | The seemingly invincible Spider-Man goes up ag... | 115.699814 | [Columbia Pictures, Laura Ziskin Productions, ... | [United States of America] | ... | Released | The battle within. | Spider-Man 3 | 5.9 | 3576 | 559 | [{"cast_id": 30, "character": "Peter Parker / ... | [{"credit_id": "52fe4252c3a36847f80151a5", "de... | 2007 | 2.0 |
| 6 | 15000000.0 | [Animation, Family] | 38757 | [hostage, magic, horse, fairy tale, musical, p... | en | Tangled | When the kingdom's most wanted-and most charmi... | 48.681969 | [Walt Disney Pictures, Walt Disney Animation S... | [United States of America] | ... | Released | They're taking adventure to new lengths. | Tangled | 7.4 | 3330 | 38757 | [{"cast_id": 34, "character": "Flynn Rider (vo... | [{"credit_id": "52fe46db9251416c91062101", "de... | 2010 | 2.0 |
| 7 | 15000000.0 | [Action, Adventure, Science Fiction] | 99861 | [marvel comic, sequel, superhero, based on com... | en | Avengers: Age of Ultron | When Tony Stark tries to jumpstart a dormant p... | 134.279229 | [Marvel Studios, Prime Focus, Revolution Sun S... | [United States of America] | ... | Released | A New Age Has Come. | Avengers: Age of Ultron | 7.3 | 6767 | 99861 | [{"cast_id": 76, "character": "Tony Stark / Ir... | [{"credit_id": "55d5f7d4c3a3683e7e0016eb", "de... | 2015 | 2.0 |
5 rows × 23 columns
Function for extracting top(first) 8 actors from the movie :¶def convert_actors(obj):
L = []
counter = 0
for i in ast.literal_eval(obj):
if counter !=8:
L.append(i['name'])
counter+=1
else:
break
return L
df['cast'] = df['cast'].apply(convert_actors)
Function to fetch the director of movie from the crew column :¶def fetch_director(obj):
L = []
for i in ast.literal_eval(obj):
if i['job'] == 'Director':
L.append(i['name'])
break
return L
df['crew'] = df['crew'].apply(fetch_director)
df['overview'] = df['overview'].apply(lambda x:x.split())
Remove spaces between words :¶df['genres'] = df['genres'].apply(lambda x:[i.replace(" ","") for i in x])
df['keywords'] = df['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
df['cast'] = df['cast'].apply(lambda x:[i.replace(" ","") for i in x])
df['crew'] = df['crew'].apply(lambda x:[i.replace(" ","") for i in x])
df['features'] = df['overview'] + df['genres'] + df['keywords'] + df['cast'] + df['crew']
df['features'] = df['features'].apply(lambda x:" ".join(x))
Lower casing all the alphabets in the tags column :¶df['features'] = df['features'].apply(lambda x:x.lower())
Apply Steming to remove similarities/duplications in words list :¶import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def xStem(txt):
y = []
for x in txt.split():
y.append(ps.stem(x))
return " ".join(y)
df['features'] = df['features'].apply(xStem)
Convert text to matrix :¶from sklearn.feature_extraction.text import HashingVectorizer
hv=HashingVectorizer(stop_words="english",n_features=7000)
hv_vector= hv.fit_transform(df['features']).toarray()
similarity = cosine_similarity(hv_vector)
pd.DataFrame(similarity,index=df['title'],columns=df['title'])
| title | Spectre | John Carter | Spider-Man 3 | Tangled | Avengers: Age of Ultron | Harry Potter and the Half-Blood Prince | Batman v Superman: Dawn of Justice | Quantum of Solace | Pirates of the Caribbean: Dead Man's Chest | The Lone Ranger | ... | On The Downlow | Sanctuary: Quite a Conundrum | Bang | Primer | Cavite | El Mariachi | Newlyweds | Signed, Sealed, Delivered | Shanghai Calling | My Date with Drew |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| title | |||||||||||||||||||||
| Spectre | 1.000000 | 0.055670 | 0.054447 | 0.015386 | 0.095730 | 0.046524 | 0.057438 | 0.265908 | 0.045980 | 0.017293 | ... | 0.049568 | 0.000000 | 0.000000 | 0.000000 | 0.013937 | 0.032141 | -0.036564 | 0.018282 | 0.019627 | 0.000000 |
| John Carter | 0.055670 | 1.000000 | 0.074092 | 0.037689 | 0.140694 | 0.056980 | 0.062531 | 0.108556 | 0.112628 | 0.028239 | ... | 0.020236 | 0.000000 | 0.053722 | 0.016449 | 0.113798 | 0.118094 | 0.000000 | 0.014927 | 0.048075 | 0.000000 |
| Spider-Man 3 | 0.054447 | 0.074092 | 1.000000 | 0.024574 | 0.122312 | 0.055728 | 0.122312 | 0.091003 | 0.073435 | 0.013809 | ... | 0.000000 | 0.000000 | 0.078811 | 0.032174 | 0.044519 | 0.064166 | 0.000000 | 0.014599 | 0.047019 | 0.031083 |
| Tangled | 0.015386 | 0.037689 | 0.024574 | 1.000000 | 0.038886 | 0.031497 | 0.051848 | -0.012859 | 0.000000 | 0.058537 | ... | 0.000000 | 0.000000 | 0.044544 | 0.013639 | 0.000000 | 0.021760 | 0.000000 | 0.012377 | 0.039862 | 0.000000 |
| Avengers: Age of Ultron | 0.095730 | 0.140694 | 0.122312 | 0.038886 | 1.000000 | 0.039193 | 0.145161 | 0.080003 | 0.096837 | 0.043704 | ... | 0.000000 | 0.000000 | 0.027714 | 0.050913 | 0.058706 | 0.108306 | 0.000000 | 0.000000 | 0.049602 | 0.065583 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| El Mariachi | 0.032141 | 0.118094 | 0.064166 | 0.021760 | 0.108306 | 0.098693 | 0.027077 | 0.161165 | 0.097538 | 0.036684 | ... | 0.000000 | 0.038292 | 0.104679 | 0.042735 | 0.226670 | 1.000000 | 0.000000 | 0.000000 | 0.124904 | 0.096334 |
| Newlyweds | -0.036564 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.043561 | 0.000000 | 0.000000 | 0.022422 | 0.000000 | 1.000000 | 0.088235 | 0.000000 | 0.000000 |
| Signed, Sealed, Delivered | 0.018282 | 0.014927 | 0.014599 | 0.012377 | 0.000000 | 0.037424 | 0.015401 | 0.000000 | -0.018493 | 0.041731 | ... | 0.039873 | 0.021780 | 0.013231 | 0.032410 | 0.022422 | 0.000000 | 0.088235 | 1.000000 | 0.015788 | 0.046967 |
| Shanghai Calling | 0.019627 | 0.048075 | 0.047019 | 0.039862 | 0.049602 | 0.080354 | 0.016534 | 0.065609 | 0.039707 | 0.000000 | ... | 0.000000 | 0.023383 | 0.071024 | 0.000000 | 0.120360 | 0.124904 | 0.000000 | 0.015788 | 1.000000 | 0.067229 |
| My Date with Drew | 0.000000 | 0.000000 | 0.031083 | 0.000000 | 0.065583 | 0.079682 | 0.016396 | 0.048795 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.028172 | 0.017252 | 0.059676 | 0.096334 | 0.000000 | 0.046967 | 0.067229 | 1.000000 |
4612 rows × 4612 columns
distances = similarity[32] ## Similarites for the movie
sorted(distances,reverse=True)[0:10]
[1.0000000000000002, 0.3360537729058417, 0.2832856927186045, 0.2783349703706405, 0.26122949691608693, 0.2545454545454546, 0.24003840921845832, 0.23994948963429277, 0.2397457108377597, 0.23472626340651012]
def recommend(movie):
movie_index = df[df['title'] == movie].index[0]
distances = similarity[movie_index]
movies_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:16]
mov=[]
id=[]
scores=[]
for i in movies_list:
mov.append(df.iloc[i[0]].title)
id.append(df.iloc[i[0]].movie_id)
scores.append(i[1])
dic={'movie_id':id,'title':mov,'Similarity Score':scores}
return pd.DataFrame(dic)
Visulization :¶#Histogram to represent runtime of movies
#plt.hist(df['runtime'])
plt.hist(df['runtime'], bins=10)
plt.title('Run Time of movies', fontdict={'fontsize':14, 'color':'brown'})
plt.xlabel("Minutes")
plt.ylabel('Number of movies')
plt.show()
#Scatter plot to compare between Budget and Revenue
plt.scatter(df['budget'], df['revenue'], marker='o', c='blue')
plt.title('Budget vs Revenue', fontdict={'fontsize':14, 'color':'brown'})
plt.xlabel("Budget")
plt.ylabel('revenue')
plt.show()
#countplot to show the top languages in the movies
sns.countplot(x='original_language',data=df,order=pd.value_counts(df['original_language']).iloc[:10].index)
sns.set(rc = {'figure.figsize':(12,8)})
plt.title('Top languages of movies', fontdict={'fontsize':14, 'color':'brown'})
Text(0.5, 1.0, 'Top languages of movies')
language=df.original_language.value_counts()
#Pie chart to represent the top language
from IPython.display import HTML
import plotly.express as px
plt.figure(figsize=(20,14))
fig= px.pie(df,
values=language.iloc[:10].values,
names=language.iloc[:10].index,
title='Top 10 Languages',
height=1050,
width=700)
HTML(fig.to_html())
<Figure size 1440x1008 with 0 Axes>
#This bar used to represent the number of movies that has a specific ratings
fig=px.bar(df,
x=df['vote_average'].value_counts().index,
y=df['vote_average'].value_counts(),
title='Overall Ratings',
text=df['vote_average'].value_counts(),
height=700
)
HTML(fig.to_html())
#The pairwise scatterplot used to show the relation between every columns
import seaborn as sns
# Create the default pairplot
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x133242a1490>
all_countries=[]
for i in df['production_countries']:
for j in i:
all_countries.append(j)
#all_countries
from collections import Counter
cont = Counter()
for text in all_countries:
cont[text] += 1
# See most common ten countries
cont.most_common()
[('United States of America', 3830),
('United Kingdom', 605),
('Germany', 314),
('France', 298),
('Canada', 253),
('Australia', 109),
('Spain', 69),
('Italy', 59),
('China', 58),
('Japan', 56),
('Hong Kong', 46),
('India', 38),
('Ireland', 37),
('Mexico', 29),
('Belgium', 25),
('Czech Republic', 24),
('New Zealand', 22),
('South Africa', 20),
('Denmark', 20),
('Switzerland', 18),
('South Korea', 18),
('Sweden', 18),
('Russia', 17),
('Netherlands', 17),
('United Arab Emirates', 14),
('Hungary', 13),
('Brazil', 13),
('Norway', 13),
('Romania', 11),
('Luxembourg', 11),
('Argentina', 9),
('Poland', 6),
('Iceland', 6),
('Israel', 6),
('Finland', 5),
('Austria', 5),
('Thailand', 5),
('Taiwan', 4),
('Morocco', 4),
('Bulgaria', 4),
('Iran', 4),
('Bahamas', 3),
('Malta', 3),
('Greece', 3),
('Jamaica', 2),
('Slovenia', 2),
('Pakistan', 2),
('Malaysia', 2),
('Peru', 2),
('Kazakhstan', 2),
('Chile', 2),
('Slovakia', 2),
('Colombia', 2),
('Dominica', 1),
('Monaco', 1),
('Tunisia', 1),
('Philippines', 1),
('Bosnia and Herzegovina', 1),
('Portugal', 1),
('Singapore', 1),
('Aruba', 1),
('Serbia', 1),
('Ukraine', 1),
('Panama', 1),
('Lithuania', 1),
('Cambodia', 1),
('Fiji', 1),
('Serbia and Montenegro', 1),
('Turkey', 1),
('Nigeria', 1),
('Cyprus', 1),
('Jordan', 1),
('Bolivia', 1),
('Ecuador', 1),
('Egypt', 1),
('Bhutan', 1),
('Lebanon', 1),
('Kyrgyz Republic', 1),
('Algeria', 1),
('Indonesia', 1),
('Guyana', 1),
('Guadaloupe', 1),
('Afghanistan', 1),
('Angola', 1),
('Dominican Republic', 1),
('Cameroon', 1),
('Kenya', 1)]
count_freq = pd.DataFrame(cont.most_common(10), columns=['countries', 'count'])
count_freq
| countries | count | |
|---|---|---|
| 0 | United States of America | 3830 |
| 1 | United Kingdom | 605 |
| 2 | Germany | 314 |
| 3 | France | 298 |
| 4 | Canada | 253 |
| 5 | Australia | 109 |
| 6 | Spain | 69 |
| 7 | Italy | 59 |
| 8 | China | 58 |
| 9 | Japan | 56 |
#The barplot used to show the Top 10 production countries
plt.figure(figsize=(15,8))
plt.title('Top 10 production countries of movies', fontdict={'fontsize':14, 'color':'brown'})
sns.barplot(x='count', y='countries', data=count_freq);
Recommendation¶How the recommender function works ?¶# we first get the index of the movie for example Alice in Wonderland index is 32
movie_index = df[df['title'] == 'Alice in Wonderland'].index[0]
movie_index
32
# we get the similarities for the movie
distances = similarity[32]
# we sort the similarities in descending order
sorted(distances,reverse=True)[0:10]
[1.0000000000000002, 0.3360537729058417, 0.2832856927186045, 0.2783349703706405, 0.26122949691608693, 0.2545454545454546, 0.24003840921845832, 0.23994948963429277, 0.2397457108377597, 0.23472626340651012]
#then we get the top 3 movies and their Index and similarity score
num=3
movies_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:num+1]
movies_list
[(525, 0.3360537729058417), (885, 0.2832856927186045), (135, 0.2783349703706405)]
# finally we access them using Index
for i in movies_list:
print(df.iloc[i[0]].title)
print(df.iloc[i[0]].movie_id)
Cars 920 The Book of Life 228326 Kung Fu Panda 3 140300
#generate random title from the list of movies
import random
def random_title():
return random.choice(df['title'])
random_title()
'The Cookout'
movie_name='Alice in Wonderland'
popular_movies = recommend(movie_name)
popular_movies
| movie_id | title | Similarity Score | |
|---|---|---|---|
| 0 | 920 | Cars | 0.336054 |
| 1 | 228326 | The Book of Life | 0.283286 |
| 2 | 140300 | Kung Fu Panda 3 | 0.278335 |
| 3 | 332 | Inspector Gadget | 0.261229 |
| 4 | 13053 | Bolt | 0.254545 |
| 5 | 49519 | The Croods | 0.240038 |
| 6 | 10477 | Driven | 0.239949 |
| 7 | 17711 | The Adventures of Rocky & Bullwinkle | 0.239746 |
| 8 | 145220 | Muppets Most Wanted | 0.234726 |
| 9 | 20542 | Delgo | 0.234726 |
| 10 | 10996 | Stuart Little 2 | 0.230170 |
| 11 | 12703 | The Brown Bunny | 0.227260 |
| 12 | 62206 | 30 Minutes or Less | 0.224733 |
| 13 | 2270 | Stardust | 0.222475 |
| 14 | 788 | Mrs. Doubtfire | 0.220946 |
Bar plot titles and similarity scores :¶import pandas as pd
plt.rcParams['figure.figsize'] = (10, 9)
sns.barplot(popular_movies['Similarity Score'],popular_movies['title'],palette='tab20')
plt.show();
C:\Users\com\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
import requests
from IPython.display import Image, HTML, display
def movie_display(popular_movies):
getList_name = {}
for x, xRows in popular_movies.iterrows():
# we get the movie id from the dataframe and we use it to get the movie poster
getResponse = requests.get('https://api.themoviedb.org/3/movie/{}?api_key=c0bda0be71f7815fd6ba2eb5f5c86fd8'.format(xRows['movie_id']) ) # every movie has a unique ID
getData = getResponse.json() # we request the data from the API and convert it to json
# a bug fixed because sometimes there are is no poster so it returrns error
if getData['poster_path']==None:
continue
else:
getPath = "http://image.tmdb.org/t/p/w500" + getData['poster_path'] # get the path of the poster
getList_name[xRows['title']] = getPath
display(HTML(f""" < <div style="font-size:24px; font-weight:Bold; color:#fff; text-align:center; padding-top:8px; height:12%; width: 100%; border:1px solid #ccc; border-radius:10px; margin-top:10px; background-color:#FA1A1A;">{movie_name}</div> """))
# in here he loops on the number of movies to be recommended which is in num_recommend variable in recommend() function
for i in range(0,popular_movies.shape[0],5):
display( HTML(f"""
<table>
<tr>
<td><img src={list(getList_name.values())[i]} style='border-radius:10px; height:400px; width:575px; border:1px solid #999;'></td>
<td><img src={list(getList_name.values())[i+2]} style='border-radius:10px; height:400px; width:575px; border:1px solid #999;'></td>
<td><img src={list(getList_name.values())[i+3]} style='border-radius:10px; height:400px; width:575px; border:1px solid #999;'></td>
<td><img src={list(getList_name.values())[i+4]} style='border-radius:10px; height:400px; width:575px; border:1px solid #999;'></td>
</tr>
<td><div style="height:60px; padding-top:15px; text-align:center; font-size:14px; font-weight:bold; border:1px solid #ccc; border-radius:10px;">{list(getList_name.keys())[i+0]}</div></td>
<td><div style="height:60px; padding-top:15px; text-align:center; font-size:14px; font-weight:bold; border:1px solid #ccc; border-radius:10px;">{list(getList_name.keys())[i+2]}</div></td>
<td><div style="height:60px; padding-top:15px; text-align:center; font-size:14px; font-weight:bold; border:1px solid #ccc; border-radius:10px;">{list(getList_name.keys())[i+3]}</div></td>
<td><div style="height:60px; padding-top:15px; text-align:center; font-size:14px; font-weight:bold; border:1px solid #ccc; border-radius:10px;">{list(getList_name.keys())[i+4]}</div></td>
</tr>
</table>"""))
movie_display(recommend(movie_name,))
![]() |
![]() |
![]() |
![]() |
Cars |
Kung Fu Panda 3 |
Inspector Gadget |
Bolt |
![]() |
![]() |
![]() |
![]() |
The Croods |
The Adventures of Rocky & Bullwinkle |
Muppets Most Wanted |
Delgo |
![]() |
![]() |
![]() |
![]() |
Stuart Little 2 |
30 Minutes or Less |
Stardust |
Mrs. Doubtfire |
Machine Learning Model¶In this section we will further clean the data and then we will create a model that predicts the rating of a movie
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from IPython.display import HTML
#we are reopening the files in new dataframes, this was just made because it was easier to collaborate this way
df_score=pd.read_csv('tmdb_5000_movies.csv')
df_score2=pd.read_csv('tmdb_5000_credits.csv')
df_score = df_score.merge(df_score2,on='title')
#the title will be like our key that merges the columns to the correct position, just like the database forgein key
We referred to the correlation matrix at the start of the section. It must be clarified that we have droped the budget and revenue, even though they seem to be important and a good predictor for a movies success. However, the correlation matrix displayed very poor relationship of between budget or revenue with vote_average
df_score.drop(['vote_count', 'movie_id', 'runtime','homepage','budget','revenue','overview','release_date','spoken_languages','status','tagline','title','original_language'], axis=1, inplace=True)
df_score.head(2)
| genres | id | keywords | original_title | popularity | production_companies | production_countries | vote_average | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | Avatar | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | [{"iso_3166_1": "US", "name": "United States o... | 7.2 | [{"cast_id": 242, "character": "Jake Sully", "... | [{"credit_id": "52fe48009251416c750aca23", "de... |
| 1 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | Pirates of the Caribbean: At World's End | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | 6.9 | [{"cast_id": 4, "character": "Captain Jack Spa... | [{"credit_id": "52fe4232c3a36847f800b579", "de... |
#a set that takes all the empty rows we need to remove
empty_rows=set()
def json_to_list(col):
# if crew we will parse from a specific key
if col == 'crew':
# will move through the entire dataframe
for i in range(0, len(df_score)):
film_director =""
# change the row content into a list
df_list_ = eval(df_score.loc[i,col])
for j in range(0,len(df_list_)):
# parse the parts with directors only
if df_list_[j]['job'] == 'Director':
film_director = df_list_[j]['name']
df_score.loc[i,col] = str(film_director)
# break because we will not need to search more
break
if film_director == "":
# if we are empty then add to the empty set
empty_rows.add(i)
df_score.loc[i,col] = ''
elif col == 'cast':
# loop through all the dataframe
for i in range(0, len(df_score)):
# this list will hold the parsed data
list_ = []
# we will convert the string list of dictionaries, into a list of dictionary that we can access
df_list_ = eval(df_score.loc[i,col])
# we will loop through this list of dictionary to extract the needed data
for j in range(0, len(df_list_)):
temp_ = df_list_[j]["name"]
list_.append(temp_.lower())
# if there is nothing then we will add this to rows that need to be deleted
if len(df_list_) == 0:
empty_rows.add(i)
# we are only interested in the first 4 actors
selected = list_[0:4]
#we add this to the row
df_score.loc[i,col] = str(selected)
else:
for i in range(0, len(df_score)):
#loop through the entire dataframe
list_ = []
#we will take each content in the row and change it to a list of dictionaries to be able to parse
df_list_ = eval(df_score.loc[i,col])
for j in range(0, len(df_list_)):
# we will select the part of data we are interested in
temp_ = df_list_[j]["name"]
list_.append(temp_.lower())
# this will happen when our list is empty, thus we will need to add it to the rows to be deleted
if len(df_list_) == 0:
empty_rows.add(i)
df_score.loc[i,col] = str(list_)
# we change the jason text to the list
json_to_list('genres')
json_to_list('keywords')
json_to_list('production_companies')
json_to_list('production_countries')
json_to_list('cast')
json_to_list('crew')
# we delete the rows that have any empty rows or columns
empty_rows = list(empty_rows)
df_score.drop(empty_rows, axis=0, inplace=True)
#we need to reset the index of the dataframe
df_score.reset_index(level=None, drop=True, inplace=True, col_level=0)
df_score
| genres | id | keywords | original_title | popularity | production_companies | production_countries | vote_average | cast | crew | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ['action', 'adventure', 'fantasy', 'science fi... | 19995 | ['culture clash', 'future', 'space war', 'spac... | Avatar | 150.437577 | ['ingenious film partners', 'twentieth century... | ['united states of america', 'united kingdom'] | 7.2 | ['sam worthington', 'zoe saldana', 'sigourney ... | James Cameron |
| 1 | ['adventure', 'fantasy', 'action'] | 285 | ['ocean', 'drug abuse', 'exotic island', 'east... | Pirates of the Caribbean: At World's End | 139.082615 | ['walt disney pictures', 'jerry bruckheimer fi... | ['united states of america'] | 6.9 | ['johnny depp', 'orlando bloom', 'keira knight... | Gore Verbinski |
| 2 | ['action', 'adventure', 'crime'] | 206647 | ['spy', 'based on novel', 'secret agent', 'seq... | Spectre | 107.376788 | ['columbia pictures', 'danjaq', 'b24'] | ['united kingdom', 'united states of america'] | 6.3 | ['daniel craig', 'christoph waltz', 'léa seydo... | Sam Mendes |
| 3 | ['action', 'crime', 'drama', 'thriller'] | 49026 | ['dc comics', 'crime fighter', 'terrorist', 's... | The Dark Knight Rises | 112.312950 | ['legendary pictures', 'warner bros.', 'dc ent... | ['united states of america'] | 7.6 | ['christian bale', 'michael caine', 'gary oldm... | Christopher Nolan |
| 4 | ['action', 'adventure', 'science fiction'] | 49529 | ['based on novel', 'mars', 'medallion', 'space... | John Carter | 43.926995 | ['walt disney pictures'] | ['united states of america'] | 6.1 | ['taylor kitsch', 'lynn collins', 'samantha mo... | Andrew Stanton |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4171 | ['drama'] | 124606 | ['gang', 'audition', 'police fake', 'homeless'... | Bang | 0.918116 | ['asylum films', 'fm entertainment', 'eagle ey... | ['united states of america'] | 6.0 | ['darling narita', 'peter greene', 'michael ne... | Ash Baron-Cohen |
| 4172 | ['science fiction', 'drama', 'thriller'] | 14337 | ['distrust', 'garage', 'identity crisis', 'tim... | Primer | 23.307949 | ['thinkfilm'] | ['united states of america'] | 6.9 | ['shane carruth', 'david sullivan', 'casey goo... | Shane Carruth |
| 4173 | ['action', 'crime', 'thriller'] | 9367 | ['united states–mexico barrier', 'legs', 'arms... | El Mariachi | 14.269792 | ['columbia pictures'] | ['mexico', 'united states of america'] | 6.6 | ['carlos gallardo', 'jaime de hoyos', 'peter m... | Robert Rodriguez |
| 4174 | ['comedy', 'drama', 'romance', 'tv movie'] | 231617 | ['date', 'love at first sight', 'narration', '... | Signed, Sealed, Delivered | 1.444476 | ['front street pictures', 'muse entertainment ... | ['united states of america'] | 7.0 | ['eric mabius', 'kristin booth', 'crystal lowe... | Scott Smith |
| 4175 | ['documentary'] | 25975 | ['obsession', 'camcorder', 'crush', 'dream girl'] | My Date with Drew | 1.929883 | ['rusty bear entertainment', 'lucky crow films'] | ['united states of america'] | 6.3 | ['drew barrymore', 'brian herzlinger', 'corey ... | Brian Herzlinger |
4176 rows × 10 columns
#function change to binary representations
def binary(genre_list, uniqueList):
#this is the list that will contain the zeros and ones representing the presence or no presence of an entity
binaryList = []
for genre in uniqueList:
if genre in genre_list:
binaryList.append(1)
else:
binaryList.append(0)
return binaryList
#function that gets the unique value
def unique_list(col):
# we used a set to not count any repeats
unique_set = set()
for i in range(0, len(df_score)):
#loop on the entire dataframe
l = df_score.iloc[i,col]
for j in range(len(l)):
#add the entities in the list to the set
unique_set.add(l[j])
unique_list = list(unique_set)
return unique_list
#converting all the cells that contain a list of strings to a list of binary to be able to use it in the score predictor
genreL = unique_list(0)
df_score['Genres_bin'] = df_score.iloc[:,0].apply(lambda x: binary(x,genreL))
castL = unique_list(8)
df_score['Actors_bin'] = df_score.iloc[:,8].apply(lambda x: binary(x,castL))
crewL = unique_list(9)
df_score['Directors_bin'] = df_score.iloc[:,9].apply(lambda x: binary(x,crewL))
keywordsL = unique_list(2)
df_score['Keywords_bin'] = df_score.iloc[:,2].apply(lambda x: binary(x,keywordsL))
df_score.head(3)
| genres | id | keywords | original_title | popularity | production_companies | production_countries | vote_average | cast | crew | Genres_bin | Actors_bin | Directors_bin | Keywords_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ['action', 'adventure', 'fantasy', 'science fi... | 19995 | ['culture clash', 'future', 'space war', 'spac... | Avatar | 150.437577 | ['ingenious film partners', 'twentieth century... | ['united states of america', 'united kingdom'] | 7.2 | ['sam worthington', 'zoe saldana', 'sigourney ... | James Cameron | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, ... | [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ... | [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, ... |
| 1 | ['adventure', 'fantasy', 'action'] | 285 | ['ocean', 'drug abuse', 'exotic island', 'east... | Pirates of the Caribbean: At World's End | 139.082615 | ['walt disney pictures', 'jerry bruckheimer fi... | ['united states of america'] | 6.9 | ['johnny depp', 'orlando bloom', 'keira knight... | Gore Verbinski | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, ... | [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, ... | [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ... | [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, ... |
| 2 | ['action', 'adventure', 'crime'] | 206647 | ['spy', 'based on novel', 'secret agent', 'seq... | Spectre | 107.376788 | ['columbia pictures', 'danjaq', 'b24'] | ['united kingdom', 'united states of america'] | 6.3 | ['daniel craig', 'christoph waltz', 'léa seydo... | Sam Mendes | [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ... | [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ... | [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, ... |
from scipy import spatial
#this is a function that uses cosine similarity to find how similar two movies are
# the similarity is upon
#1) genre
#2) Actors
#3) Directors
#4) Keywords
def Similarity(movieId1, movieId2):
a = df_score.iloc[movieId1]
b = df_score.iloc[movieId2]
genresA = a['Genres_bin']
genresB = b['Genres_bin']
genreDistance = spatial.distance.cosine(genresA, genresB)
scoreA = a['Actors_bin']
scoreB = b['Actors_bin']
scoreDistance = spatial.distance.cosine(scoreA, scoreB)
directA = a['Directors_bin']
directB = b['Directors_bin']
directDistance = spatial.distance.cosine(directA, directB)
wordsA = a['Keywords_bin']
wordsB = b['Keywords_bin']
wordsDistance = spatial.distance.cosine(wordsA, wordsB)
# because we are adding 4 cosine similarity values the values will be between 0 and 4 inclusive
return genreDistance + directDistance + scoreDistance + wordsDistance
Similarity(3,234) #lets check similarity between any 2 random movies
0.9062687954957778
#This is a function that calculates the similarity of a choosen movie and compares it with the rest, and finds the top
# 10 most similar movies and gets the average rating
def score_prdictor(movieID):
#create a disctionary that key: the index of the row, value: the similarity score
similarity_dic = dict()
#loop through the dataframe to get the similarity scores
for i in range(0, len(df_score)):
if i == movieID:
#we don't want to consider the movieID that we chose itself, because it will already get a score of 0
continue
temp_similarity = Similarity(movieID,i)
similarity_dic[i]= temp_similarity
#change the dictionary to a list of tuples
similarity_list = list(similarity_dic.items())
#sort the list according to the similarity score
similarity_list = sorted(similarity_list, key = lambda x: x[1])[:10]
#total_score is used so we can get the averahe vote prediction
total_score = 0
for i in similarity_list:
temp = df_score.loc[int(i[0]), 'vote_average']
total_score+= temp
return round(total_score/10,2)
score_prdictor(4)
6.64
# real will store the actual vote_average
#predicted will store the predicted vote_average using cosine similarity function explained above
real =[]
predicted =[]
#we will test on the first 100 rows
for i in range(0,100):
real.append(df_score.loc[i,'vote_average'])
predicted.append(score_prdictor(i))
# created a new dataframe to have the data of real and predicted
df_accuracy = pd.DataFrame()
df_accuracy['Real']=real
df_accuracy['Predicted']=predicted
# we will calculate the percentage error
percentage_error =[]
for i in range(0, len(df_accuracy)):
approx = df_accuracy.loc[i,'Predicted']
exact = df_accuracy.loc[i,'Real']
error = (abs(approx-exact)/exact)*100
percentage_error.append(round(error,2))
df_accuracy["Percentage Error%"] = percentage_error
df_accuracy.head(3)
| Real | Predicted | Percentage Error% | |
|---|---|---|---|
| 0 | 7.2 | 7.06 | 1.94 |
| 1 | 6.9 | 6.59 | 4.49 |
| 2 | 6.3 | 6.74 | 6.98 |
Average_error = sum(percentage_error)/len(percentage_error)
print("Average Percentage Error: "+str(round(Average_error,2))+" %")
Average Percentage Error: 9.66 %
df_accuracy.plot( y=['Real','Predicted'], kind = 'line')
plt.show()
genres = pd.Series([categ for row in df['genres'] for categ in row])
#geners
#list of all genres in the column "with repetition"
all_genres=[]
for i in genres:
all_genres.append(i)
#all_genres
#get all different geners in list
diff_genres = list( dict.fromkeys(all_genres) )
#print(diff_genres)
#cloud of words of all genres
import matplotlib.pyplot as plt
from wordcloud import WordCloud
my_list=diff_genres
#convert list to string and generate
unique_string=(" ").join(my_list)
wordcloud = WordCloud(width = 1000, height = 500).generate(unique_string)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
plt.close()
Create a counter for the frequency table
from collections import Counter
cnt = Counter()
for text in all_genres:
cnt[text] += 1
# See most common ten words
cnt.most_common()
[('Drama', 2172),
('Comedy', 1705),
('Thriller', 1248),
('Action', 1122),
('Romance', 865),
('Adventure', 763),
('Crime', 674),
('ScienceFiction', 525),
('Horror', 507),
('Family', 504),
('Fantasy', 410),
('Mystery', 335),
('Animation', 232),
('Music', 175),
('History', 152),
('War', 116),
('Documentary', 101),
('Western', 72),
('Foreign', 32),
('TVMovie', 5)]
Create a frequency table
#make a data frame of all words and there frequency
import pandas as pd
word_freq = pd.DataFrame(cnt.most_common(), columns=['words', 'count'])
word_freq
| words | count | |
|---|---|---|
| 0 | Drama | 2172 |
| 1 | Comedy | 1705 |
| 2 | Thriller | 1248 |
| 3 | Action | 1122 |
| 4 | Romance | 865 |
| 5 | Adventure | 763 |
| 6 | Crime | 674 |
| 7 | ScienceFiction | 525 |
| 8 | Horror | 507 |
| 9 | Family | 504 |
| 10 | Fantasy | 410 |
| 11 | Mystery | 335 |
| 12 | Animation | 232 |
| 13 | Music | 175 |
| 14 | History | 152 |
| 15 | War | 116 |
| 16 | Documentary | 101 |
| 17 | Western | 72 |
| 18 | Foreign | 32 |
| 19 | TVMovie | 5 |
Create the plot
plt.figure(figsize=(10,10))
p=sns.jointplot(x='count', y='words', data=word_freq);
p.fig.suptitle('Most common genres')
Text(0.5, 0.98, 'Most common genres')
<Figure size 720x720 with 0 Axes>
plt.figure(figsize=(10,8))
sns.histplot(x='count', y='words', data=word_freq);
import pylab as pl
pl.suptitle("Frequency of genres")
Text(0.5, 0.98, 'Frequency of genres')
from cycler import cycler
import matplotlib as mpl
COLOR = 'crimson'
mpl.rcParams['text.color'] = COLOR
mpl.rcParams['axes.labelcolor'] = COLOR
mpl.rcParams['xtick.color'] = COLOR
mpl.rcParams['ytick.color'] = COLOR
plt.figure(figsize=(10,8))
sns.pointplot(x='count', y='words', data=word_freq,color="darkblue");
import pylab as pl
pl.suptitle("Frequency of genres")
Text(0.5, 0.98, 'Frequency of genres')
plt.figure(figsize=(10,8))
sns.barplot(x='count', y='words', data=word_freq);
import pylab as pl
pl.suptitle("Frequency of genres")
Text(0.5, 0.98, 'Frequency of genres')
As we can see from the visualization the top three repeated genres are (Drama, comedy, thriller)
#arranging the rows according to the popularity in descending order
import pandas as pd
p= pd.DataFrame(df.sort_values(by=['popularity'], ascending=False).head(10))
plt.figure(figsize=(10,10))
sns.pointplot(x='original_title', y='popularity', data=p);
plt.xticks(rotation=90)
import pylab as pl
pl.suptitle("Top 10 popular movies")
Text(0.5, 0.98, 'Top 10 popular movies')
plt.figure(figsize=(10,8))
sns.histplot(x='original_title', y='popularity', data=p);
plt.xticks(rotation=90)
import pylab as pl
pl.suptitle("Top 10 popular movies")
Text(0.5, 0.98, 'Top 10 popular movies')
plt.figure(figsize=(10,8))
sns.barplot(x='original_title', y='popularity', data=p);
plt.xticks(rotation=90)
import pylab as pl
pl.suptitle("Top 10 popular movies")
Text(0.5, 0.98, 'Top 10 popular movies')
As we can see from the visualizations that the Minions is the most popular movie.
#Dataframe of years and movies
year_df = pd.DataFrame(df['release_year'].value_counts().reset_index())
year_df.columns = ['year', 'movies']
year_df=pd.DataFrame(year_df.sort_values(by=['movies'], ascending=False).head(10))
plt.figure(figsize=(30,10))
sns.pointplot(x='year', y='movies', data=year_df);
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), [Text(0, 0, '2004'), Text(1, 0, '2005'), Text(2, 0, '2006'), Text(3, 0, '2008'), Text(4, 0, '2009'), Text(5, 0, '2010'), Text(6, 0, '2011'), Text(7, 0, '2013'), Text(8, 0, '2014'), Text(9, 0, '2015')])
Answer: 2009 was the top year that had a 241 movie
#make actor dataframe that had four columns first_actor, second_actor, third_actor, and popularity
actor= pd.DataFrame(df_score['cast'])
#use the cast column and split it by commas
actor=actor['cast'].str.split(',', expand=True)
#there were extra empty columns
actor=actor.drop(columns=[3, 4])
#remove "["
actor[0]=actor[0].str.replace("[","")
actor.columns = ['first_actor', 'second_actor', 'third_actor']
popularity = df_score["popularity"]
actor = actor.join(popularity)
actor
| first_actor | second_actor | third_actor | popularity | |
|---|---|---|---|---|
| 0 | 'sam worthington' | 'zoe saldana' | 'sigourney weaver' | 150.437577 |
| 1 | 'johnny depp' | 'orlando bloom' | 'keira knightley' | 139.082615 |
| 2 | 'daniel craig' | 'christoph waltz' | 'léa seydoux' | 107.376788 |
| 3 | 'christian bale' | 'michael caine' | 'gary oldman' | 112.312950 |
| 4 | 'taylor kitsch' | 'lynn collins' | 'samantha morton' | 43.926995 |
| ... | ... | ... | ... | ... |
| 4171 | 'darling narita' | 'peter greene' | 'michael newland' | 0.918116 |
| 4172 | 'shane carruth' | 'david sullivan' | 'casey gooden' | 23.307949 |
| 4173 | 'carlos gallardo' | 'jaime de hoyos' | 'peter marquardt' | 14.269792 |
| 4174 | 'eric mabius' | 'kristin booth' | 'crystal lowe' | 1.444476 |
| 4175 | 'drew barrymore' | 'brian herzlinger' | 'corey feldman' | 1.929883 |
4176 rows × 4 columns
#group each actor column with the popularity
one=actor.groupby('first_actor', as_index=False).sum()
two=actor.groupby('second_actor', as_index=False).sum()
three=actor.groupby('third_actor', as_index=False).sum()
#name in each dataframe the two columns the same
one.columns = ['actors', 'popularity']
two.columns = ['actors', 'popularity']
three.columns = ['actors', 'popularity']
#combine all data frames together to have all actors in same column and each one has the popularity of his movies
frames = [one, two,three]
result = pd.concat(frames)
result=result.groupby('actors', as_index=False).sum()
#get the top actors
top_actors= pd.DataFrame(result.sort_values(by=['popularity'], ascending=False).head(10))
plt.figure(figsize=(10,10))
sns.pointplot(x='actors', y='popularity', data=top_actors);
plt.xticks(rotation=90)
import pylab as pl
pl.suptitle("Top ten actors")
Text(0.5, 0.98, 'Top ten actors')
plt.figure(figsize=(10,8))
sns.histplot(x='actors', y='popularity', data=top_actors);
plt.xticks(rotation=90)
import pylab as pl
pl.suptitle("Top ten actors")
Text(0.5, 0.98, 'Top ten actors')
plt.figure(figsize=(10,8))
sns.barplot(x='actors', y='popularity', data=top_actors);
plt.xticks(rotation=90)
import pylab as pl
pl.suptitle("Top ten actors")
Text(0.5, 0.98, 'Top ten actors')
The actor that had the most successful movies is "Johnny Depp"
df_score_no=pd.read_csv('tmdb_5000_movies.csv')
df_score_no[df_score_no['budget'] == df_score_no['budget'].max()]
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 380000000 | [{"id": 12, "name": "Adventure"}, {"id": 28, "... | http://disney.go.com/pirates/index-on-stranger... | 1865 | [{"id": 658, "name": "sea"}, {"id": 1316, "nam... | en | Pirates of the Caribbean: On Stranger Tides | Captain Jack Sparrow crosses paths with a woma... | 135.413856 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | 2011-05-14 | 1045713802 | 136.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Live Forever Or Die Trying. | Pirates of the Caribbean: On Stranger Tides | 6.4 | 4948 |
df_score_no[['id','title','budget', 'revenue']].sort_values(['budget'], ascending=False).head(10).style.background_gradient(subset=['budget', 'revenue'], cmap='PuBu')
| id | title | budget | revenue | |
|---|---|---|---|---|
| 17 | 1865 | Pirates of the Caribbean: On Stranger Tides | 380000000 | 1045713802 |
| 1 | 285 | Pirates of the Caribbean: At World's End | 300000000 | 961000000 |
| 7 | 99861 | Avengers: Age of Ultron | 280000000 | 1405403694 |
| 10 | 1452 | Superman Returns | 270000000 | 391081192 |
| 4 | 49529 | John Carter | 260000000 | 284139100 |
| 6 | 38757 | Tangled | 260000000 | 591794936 |
| 5 | 559 | Spider-Man 3 | 258000000 | 890871626 |
| 13 | 57201 | The Lone Ranger | 255000000 | 89289910 |
| 46 | 127585 | X-Men: Days of Future Past | 250000000 | 747862775 |
| 22 | 57158 | The Hobbit: The Desolation of Smaug | 250000000 | 958400000 |
"Pirates of the caribbean:On Stranger Tides" had the highest budget
df_score_no[df_score_no['runtime'] == df_score_no['runtime'].max()]
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2384 | 18000000 | [{"id": 80, "name": "Crime"}, {"id": 18, "name... | NaN | 43434 | [{"id": 1419, "name": "gun"}, {"id": 7336, "na... | en | Carlos | The story of Venezuelan revolutionary, Ilich R... | 1.138383 | [{"name": "Egoli Tossell Film AG", "id": 2254}... | [{"iso_3166_1": "FR", "name": "France"}, {"iso... | 2010-05-19 | 871279 | 338.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | The man who hijacked the world | Carlos | 6.7 | 50 |
"The story of Venezuelan revolutionary" is the longest movie
df_score_no['release_date'] = pd.to_datetime(df_score_no['release_date'], infer_datetime_format=True)
df_score_no['release_day'] = df_score_no['release_date'].apply(lambda t: t.day)
df_score_no['release_weekday'] = df_score_no['release_date'].apply(lambda t: t.weekday())
df_score_no['release_month'] = df_score_no['release_date'].apply(lambda t: t.month)
df_score_no['release_year'] = df_score_no['release_date'].apply(lambda t: t.year if t.year < 2018 else t.year -100)
plt.figure(figsize=(20,12))
edgecolor=(0,0,0),
sns.countplot(df_score_no['release_month'].sort_values(), palette = "Dark2", edgecolor=(0,0,0))
plt.title("Movie Release count by Month",fontsize=20)
plt.xlabel('Release Month')
plt.ylabel('Number of Movies Release')
plt.xticks(fontsize=12)
plt.show()